{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2019 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to demonstrate how to train a model and generate local predictions.\n",
    "\n",
    "\n",
    "##  The data\n",
    "The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n",
    "uses for training is provided by the [UC Irvine Machine Learning\n",
    "Repository](https://archive.ics.uci.edu/ml/datasets/). Google has hosted the data on a public GCS bucket `gs://cloud-samples-data/ml-engine/sklearn/census_data/` and also hosted in the UC Irvine dataset repository.\n",
    "\n",
    " * Training file is `adult.data`\n",
    " * Evaluation file is `adult.test`\n",
    "\n",
    "Note: Your typical development process with your own data would require you to upload your data to GCS so that you can access that data from inside your notebook. However, in this case, Google has put the data on GCS to avoid the steps of having you download the data from UC Irvine and then upload the data to GCS.\n",
    "\n",
    "### Disclaimer\n",
    "This dataset is provided by a third party. Google provides no representation,\n",
    "warranty, or other guarantees about the validity or any other aspects of this dataset.\n",
    "\n",
    "# Build your model\n",
    "\n",
    "First, you'll create the model (provided below). This is similar to your normal process for creating a scikit-learn model. However, there is one key difference:\n",
    "\n",
    "1. Downloading the data at the start of your file, so that you can access the data.\n",
    "\n",
    "The code in this file loads the data into a pandas DataFrame that can be used by scikit-learn. Then the model is fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to [AI Platform's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.externals import joblib\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.pipeline import FeatureUnion\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import LabelBinarizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add code to download the data (in this case, using the publicly hosted data).\n",
    "you will then be able to use the data when training your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the data\n",
    "! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output adult.data\n",
    "! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output adult.test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the format of your input data including unused columns (These are the columns from the census data files)\n",
    "COLUMNS = (\n",
    "    'age',\n",
    "    'workclass',\n",
    "    'fnlwgt',\n",
    "    'education',\n",
    "    'education-num',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'capital-gain',\n",
    "    'capital-loss',\n",
    "    'hours-per-week',\n",
    "    'native-country',\n",
    "    'income-level'\n",
    ")\n",
    "\n",
    "# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn\n",
    "CATEGORICAL_COLUMNS = (\n",
    "    'workclass',\n",
    "    'education',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'native-country'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the training census dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./adult.data', 'r') as train_data:\n",
    "    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)\n",
    "\n",
    "# Remove the column you are trying to predict ('income-level') from our features list\n",
    "# Convert the Dataframe to a lists of lists\n",
    "train_features = raw_training_data.drop('income-level', axis=1).values.tolist()\n",
    "# Create our training labels list, convert the Dataframe to a lists of lists\n",
    "train_labels = (raw_training_data['income-level'] == ' >50K').values.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the test census dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./adult.test', 'r') as test_data:\n",
    "    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)\n",
    "# Remove the column we are trying to predict ('income-level') from our features list\n",
    "# Convert the Dataframe to a lists of lists\n",
    "test_features = raw_testing_data.drop('income-level', axis=1).values.tolist()\n",
    "# Create our training labels list, convert the Dataframe to a lists of lists\n",
    "test_labels = (raw_testing_data['income-level'] == ' >50K.').values.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is where your model code would go. Below is an example model using the census dataset.\n",
    "Since the census data set has categorical features, you need to convert\n",
    "them to numerical values. You'll use a list of pipelines to convert each\n",
    "categorical column and then use FeatureUnion to combine them before calling the RandomForestClassifier.\n",
    "\n",
    "Each categorical column needs to be extracted individually and converted to a numerical value.\n",
    "To do this, each categorical column will use a pipeline that extracts one feature column via\n",
    " `SelectKBest(k=1) and a LabelBinarizer()` to convert the categorical value to a numerical one.\n",
    "A scores array (created below) will select and extract the feature column. The scores array is\n",
    "created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_pipelines = []\n",
    "\n",
    "for i, col in enumerate(COLUMNS[:-1]):\n",
    "    if col in CATEGORICAL_COLUMNS:\n",
    "        # Create a scores array to get the individual categorical column.\n",
    "        # Example:\n",
    "        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical', \n",
    "        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']\n",
    "        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "        #\n",
    "        # Returns: [['State-gov']]\n",
    "        # Build the scores array\n",
    "        scores = [0] * len(COLUMNS[:-1])\n",
    "        # This column is the categorical column you want to extract.\n",
    "        scores[i] = 1\n",
    "        skb = SelectKBest(k=1)\n",
    "        skb.scores_ = scores\n",
    "        # Convert the categorical column to a numerical value\n",
    "        lbn = LabelBinarizer()\n",
    "        r = skb.transform(train_features)\n",
    "        lbn.fit(r)\n",
    "        # Create the pipeline to extract the categorical feature\n",
    "        categorical_pipelines.append(\n",
    "            ('categorical-{}'.format(i), Pipeline([\n",
    "                ('SKB-{}'.format(i), skb),\n",
    "                ('LBN-{}'.format(i), lbn)])))\n",
    "\n",
    "# Create pipeline to extract the numerical features\n",
    "skb = SelectKBest(k=6)\n",
    "# From COLUMNS use the features that are numerical\n",
    "skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]\n",
    "categorical_pipelines.append(('numerical', skb))\n",
    "\n",
    "# Combine all the features using FeatureUnion\n",
    "preprocess = FeatureUnion(categorical_pipelines)\n",
    "\n",
    "# Create the classifier\n",
    "classifier = RandomForestClassifier()\n",
    "\n",
    "# Transform the features and fit them to the classifier\n",
    "classifier.fit(preprocess.transform(train_features), train_labels)\n",
    "\n",
    "# Create the overall model as a single pipeline\n",
    "pipeline = Pipeline([\n",
    "    ('union', preprocess),\n",
    "    ('classifier', classifier)\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Export the model to a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = 'model.joblib'\n",
    "joblib.dump(pipeline, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -al model.joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predictions\n",
    "Get one person that makes <=50K and one that makes >50K to test our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Show a person that makes <=50K:')\n",
    "print('\\tFeatures: {0} --> Label: {1}\\n'.format(test_features[0], test_labels[0]))\n",
    "\n",
    "with open('less_than_50K.json', 'w') as outfile:\n",
    "  json.dump(test_features[0], outfile)\n",
    "\n",
    "print('Show a person that makes >50K:')\n",
    "print('\\tFeatures: {0} --> Label: {1}'.format(test_features[3], test_labels[3]))\n",
    "\n",
    "with open('more_than_50K.json', 'w') as outfile:\n",
    "  json.dump(test_features[3], outfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use Python to make local predictions\n",
    "Test the model with the entire test set and print out some of the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_results = pipeline.predict(test_features)\n",
    "local = pd.Series(local_results, name='local')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the first 10 responses\n",
    "for i, response in enumerate(local[:10]):\n",
    "    print('Prediction: {}\\tLabel: {}'.format(response, test_labels[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Optional] Verify Results\n",
    "Use a confusion matrix to create a visualization of the local predicted results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "actual = pd.Series(test_labels, name='actual')\n",
    "local_predictions = pd.Series(local_results, name='local')\n",
    "\n",
    "pd.crosstab(actual, local_predictions)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
